notes/mit6.824/html/6.824 Lab 1_ MapReduce.html

<!DOCTYPE html>
<!-- saved from url=(0052)http://nil.csail.mit.edu/6.824/2020/labs/lab-mr.html -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>6.824 Lab 1: MapReduce</title>
<link rel="stylesheet" href="./assets/style.css" type="text/css">
</head>
<body>
<div align="center">
<h2><a href="http://nil.csail.mit.edu/6.824/2020/index.html">6.824</a> - Spring 2020</h2>
<h1>6.824 Lab 1: MapReduce</h1>
<h3>Due: Feb 14 23:59</h3>
</div>
<hr>

<h3>Introduction</h3>

<p>
In this lab you'll build a MapReduce system.
You'll implement a worker process that calls application Map and Reduce
functions and handles reading and writing files,
and a master process that hands out tasks to
workers and copes with failed workers.
You'll be building something similar to the
<a href="http://research.google.com/archive/mapreduce-osdi04.pdf">MapReduce paper</a>.

</p><h3>Collaboration Policy</h3>
<p>
You must write all the code you hand in for 6.824, except for code
that we give you as part of assignments. You are not allowed to
look at anyone else's solution, and you are not allowed to look at
solutions from previous years. You may discuss the assignments with
other students, but you may not look at or copy each others' code. The
reason for this rule is that we believe you will learn the most by
designing and implementing your lab solution yourself.

</p><p>
Please do not publish your code or make
it available to current or future 6.824 students.
<tt>github.com</tt> repositories are public by default, so please
don't put your code there unless you make the repository private. You
may find it convenient to use
<a href="https://github.mit.edu/">MIT's GitHub</a>,
but be sure to create a private repository.

</p><h3>Software</h3>
<p>
You'll implement this lab (and all the labs) in
<a href="http://www.golang.org/">Go</a>. The Go web site contains lots
of tutorial information.
We will grade your
labs using Go version 1.13; you should use 1.13 too. You can check your Go
version by running <tt>go version</tt>.

</p><p>
We recommend that you work on the labs on your own machine, so you can use the
tools, text editors, etc. that you are already familiar with. Alternatively,
you can work on the labs on Athena.

</p><h4>macOS</h4>

<p>
You can use <a href="https://brew.sh/">Homebrew</a> to install Go. After
installing Homebrew, run <tt>brew install go</tt>.

</p><h4>Linux</h4>

<p>
Depending on your Linux distribution, you might be able to get an up-to-date
version of Go from the package repository, e.g. by running <tt>apt install
golang</tt>. Otherwise, you can manually install a binary from Go's website.
First, make sure that you're running a 64-bit kernel (<tt>uname -a</tt> should
mention "x86_64 GNU/Linux"), and then run:

</p><pre>$ wget -qO- https://dl.google.com/go/go1.13.6.linux-amd64.tar.gz | sudo tar xz -C /usr/local
</pre>

You'll need to make sure <tt>/usr/local/bin</tt> is on your <tt>PATH</tt>.

<h4>Windows</h4>

<p>
The labs probably won't work directly on Windows. If you're feeling
adventurous, you can try to get them running inside <a href="https://docs.microsoft.com/en-us/windows/wsl/install-win10">Windows
Subsystem for Linux</a> and following the Linux instructions above. Otherwise,
you can fall back to Athena.

</p><h4>Athena</h4>

<p>
You can log into a public Athena host with <tt>ssh {your
kerberos}@athena.dialup.mit.edu</tt>. Once you're logged in, to get Go 1.13,
run:

</p><pre>$ setup ggo
</pre>

<h3>Getting started</h3>

<p>
You'll fetch the initial lab software with
<a href="https://git-scm.com/">git</a> (a version control system).
To learn more about git, look at the
<a href="https://git-scm.com/book/en/v2">Pro Git book</a> or the
<a href="http://www.kernel.org/pub/software/scm/git/docs/user-manual.html">git user's manual</a>.
To fetch the 6.824 lab software:

</p><pre>$ git clone git://g.csail.mit.edu/6.824-golabs-2020 6.824
$ cd 6.824
$ ls
Makefile src
$
</pre>

<p>
We supply you with a simple sequential mapreduce implementation in
<tt>src/main/mrsequential.go</tt>. It runs the maps and reduces one at a
time, in a single process. We also
provide you with a couple of MapReduce applications: word-count
in <tt>mrapps/wc.go</tt>, and a text indexer
in <tt>mrapps/indexer.go</tt>. You can run
word count sequentially as follows:

</p><pre>$ cd ~/6.824
$ cd src/main
$ go build -buildmode=plugin ../mrapps/wc.go
$ rm mr-out*
$ go run mrsequential.go wc.so pg*.txt
$ more mr-out-0
A 509
ABOUT 2
ACT 8
...
</pre>

<p>
<tt>mrsequential.go</tt> leaves its output in the file <tt>mr-out-0</tt>.
The input is from the text files named <tt>pg-xxx.txt</tt>.

</p><p>
Feel free to borrow code from <tt>mrsequential.go</tt>.
You should also have a look at <tt>mrapps/wc.go</tt> to see what
MapReduce application code looks like.

</p><h3>Your Job</h3>

Your job is to implement a distributed MapReduce, consisting of
two programs, the master and the worker. There will be
just one master process, and one or more worker processes executing in
parallel. In a real system the workers would run on a bunch of
different machines, but for this lab you'll run them all on a single machine.
The workers will talk to the master via RPC. Each worker process will ask
the master for a task, read the task's input from one or more files,
execute the task, and write the task's output to one
or more files. The master should notice if a worker hasn't completed
its task in a reasonable amount of time (for this lab, use ten
seconds), and give the same task to a different worker.

<p>
We have given you a little code to start you off. The "main" routines for
the master and worker are in <tt>main/mrmaster.go</tt> and <tt>main/mrworker.go</tt>;
don't change these files. You should put your implementation in <tt>mr/master.go</tt>,
<tt>mr/worker.go</tt>, and <tt>mr/rpc.go</tt>.

</p><p>
Here's how to run your code on the word-count MapReduce
application. First, make sure the word-count plugin is
freshly built:
</p><pre>$ go build -buildmode=plugin ../mrapps/wc.go
</pre>

In the <tt>main</tt> directory, run the master.
<pre>$ rm mr-out*
$ go run mrmaster.go pg-*.txt
</pre>

The <tt>pg-*.txt</tt> arguments to <tt>mrmaster.go</tt> are
the input files; each file corresponds to one "split", and is the
input to one Map task.

<p>
In one or more other windows, run some workers:
</p><pre>$ go run mrworker.go wc.so
</pre>

When the workers and master have finished, look at the output
in <tt>mr-out-*</tt>. When you've completed the lab, the
sorted union of the output files should match the sequential
output, like this:
<pre>$ cat mr-out-* | sort | more
A 509
ABOUT 2
ACT 8
...
</pre>

<p>
We supply you with a test script in <tt>main/test-mr.sh</tt>. The tests
check that the <tt>wc</tt> and <tt>indexer</tt> MapReduce applications
produce the correct output when given the <tt>pg-xxx.txt</tt> files as
input. The tests also check that your implementation runs the Map and
Reduce tasks in parallel, and that your implementation recovers from
workers that crash while running tasks.

</p><p>
If you run the test script now, it will hang because the master never finishes:
</p>

<pre>$ cd ~/6.824/src/main
$ sh test-mr.sh
*** Starting wc test.
</pre>

<p>
You can change <tt>ret := false</tt> to true in the Done function in <tt>mr/master.go</tt>
so that the master exits immediately. Then:
</p>

<pre>$ sh ./test-mr.sh
*** Starting wc test.
sort: No such file or directory
cmp: EOF on mr-wc-all
--- wc output is not the same as mr-correct-wc.txt
--- wc test: FAIL
$
</pre>

<p>
The test script expects to see output in files named <tt>mr-out-X</tt>, one
for each reduce task. The empty implementations of <tt>mr/master.go</tt>
and <tt>mr/worker.go</tt> don't produce those files (or do much of
anything else), so the test fails.

</p><p>
When you've finished, the test script output should look like this:

</p><pre>$ sh ./test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
--- reduce parallelism test: PASS
*** Starting crash test.
--- crash test: PASS
*** PASSED ALL TESTS
$
</pre>

<p>
You'll also see some errors from the Go RPC package that look like
</p><pre>2019/12/16 13:27:09 rpc.Register: method "Done" has 1 input parameters; needs exactly three
</pre>
Ignore these messages.

<p>
A few rules:

</p><ul>

<li>
The map phase should divide the intermediate keys into buckets for
<tt>nReduce</tt> reduce tasks,
where <tt>nReduce</tt> is the argument that
<tt>main/mrmaster.go</tt> passes to <tt>MakeMaster()</tt>.

</li><li> The worker implementation should put the output of the X'th
reduce task in the file <tt>mr-out-X</tt>.

</li><li> A <tt>mr-out-X</tt> file should contain one line per Reduce
function output. The line should be generated with the Go <tt>"%v %v"</tt>
format, called with the key and value. Have a look in <tt>main/mrsequential.go</tt>
for the line commented "this is the correct format".
The test script will fail if your implementation deviates too much from this format.

</li><li> You can modify <tt>mr/worker.go</tt>, <tt>mr/master.go</tt>, and <tt>mr/rpc.go</tt>.
You can temporarily modify other files for testing, but make sure your code works
with the original versions; we'll test with the original versions.

</li><li> The worker should put intermediate Map output in files in the current
directory, where your worker can later read them as input to Reduce tasks.

</li><li> <tt>main/mrmaster.go</tt> expects <tt>mr/master.go</tt> to implement a
<tt>Done()</tt> method that returns true when the MapReduce job is completely finished;
at that point, <tt>mrmaster.go</tt> will exit.

</li><li> When the job is completely finished, the worker processes should exit.
A simple way to implement this is to use the return value from <tt>call()</tt>:
if the worker fails to contact the master, it can assume that the master has exited
because the job is done, and so the worker can terminate too. Depending on your
design, you might also find it helpful to have a "please exit" pseudo-task
that the master can give to workers.

</li></ul>

<h3>Hints</h3>

<ul>

<li> One way to get started is to modify <tt>mr/worker.go</tt>'s
<tt>Worker()</tt> to send an RPC to the master asking for a task. Then
modify the master to respond with the file name of an as-yet-unstarted
map task. Then modify the worker to read that file and call the
application Map function, as in <tt>mrsequential.go</tt>.

</li><li> The application Map and Reduce functions are loaded at run-time
using the Go plugin package, from files whose names end in <tt>.so</tt>.

</li><li> If you change anything in the <tt>mr/</tt> directory, you will
probably have to re-build any MapReduce plugins you use, with
something like <tt>go build -buildmode=plugin ../mrapps/wc.go</tt>

</li><li> This lab relies on the workers sharing a file system.
That's straightforward when all workers run on the same machine, but would require a global
filesystem like GFS if the workers ran on different machines.

</li><li> A reasonable naming convention for intermediate files is <tt>mr-X-Y</tt>,
where X is the Map task number, and Y is the reduce task number.

</li><li> The worker's map task code will need a way to store intermediate
key/value pairs in files in a way that can be correctly read back
during reduce tasks. One possibility is to use Go's <tt>encoding/json</tt> package. To
write key/value pairs to a JSON file:
<pre>  enc := json.NewEncoder(file)
  for _, kv := ... {
    err := enc.Encode(&amp;kv)
</pre>
and to read such a file back:
<pre>  dec := json.NewDecoder(file)
  for {
    var kv KeyValue
    if err := dec.Decode(&amp;kv); err != nil {
      break
    }
    kva = append(kva, kv)
  }
</pre>

</li><li> The map part of your worker can use the <tt>ihash(key)</tt> function
(in <tt>worker.go</tt>) to pick the reduce task for a given key.

</li><li> You can steal some code from <tt>mrsequential.go</tt> for reading
Map input files, for sorting intermedate key/value pairs between the
Map and Reduce, and for storing Reduce output in files.

</li><li> The master, as an RPC server, will be concurrent; don't forget
to lock shared data.

</li><li> Use Go's race detector, with <tt>go build -race</tt> and <tt>go run -race</tt>.
<tt>test-mr.sh</tt> has a comment that shows you how to enable the
race detector for the tests.

</li><li> Workers will sometimes need to wait, e.g. reduces can't start
until the last map has finished. One possibility is for workers to
periodically ask the master for work, sleeping
with <tt>time.Sleep()</tt> between each request. Another possibility
is for the relevant RPC handler in the master to have a loop that
waits, either with <tt>time.Sleep()</tt> or <tt>sync.Cond</tt>. Go
runs the handler for each RPC in its own thread, so the fact that one
handler is waiting won't prevent the master from processing other
RPCs.

</li><li> The master can't reliably distinguish between crashed workers,
workers that are alive but have stalled for some reason,
and workers that are executing but too slowly to be useful.
The best you can do is have the master wait for
some amount of time, and then give up and re-issue the task to
a different worker. For this lab, have the master wait for
ten seconds; after that the master should assume the worker has
died (of course, it might not have).

</li><li> To test crash recovery, you can use the <tt>mrapps/crash.go</tt>
application plugin. It randomly exits in the Map and Reduce functions.

</li><li> To ensure that nobody observes partially written files in the presence of
    crashes, the MapReduce paper mentions the trick of using a temporary file
    and atomically renaming it once it is completely written. You can use
    <tt>ioutil.TempFile</tt> to create a temporary file and <tt>os.Rename</tt>
    to atomically rename it.

</li><li> <tt>test-mr.sh</tt> runs all the processes in the sub-directory
<tt>mr-tmp</tt>, so if something goes wrong and you want to look
at intermediate or output files, look there.


</li></ul>

<h3>Handin procedure</h3>

<div class="important">
<p>
Before submitting, please run <tt>test-mr.sh</tt> one final time.
</p>
</div>

<p>
Use the <tt>make lab1</tt> command to package your lab assignment and upload it
to the class's submission website, located at
<a href="https://6824.scripts.mit.edu/2020/handin.py/">https://6824.scripts.mit.edu/2020/handin.py/</a>.
</p>

<p>
You may use your MIT Certificate or request an API key via
email to log in for the first time. Your API key (<tt>XXX</tt>)
is displayed once you logged in, which can be used to upload
lab1 from the console as follows.
</p>

<pre>$ cd ~/6.824
$ echo XXX &gt; api.key
$ make lab1</pre>

<p class="important">
Check the submission website to make sure it thinks you submitted this lab!
</p>

<p class="note">
You may submit multiple times. We will use the timestamp of
your <strong>last</strong> submission for the purpose of
calculating late days.
</p>

<hr>

<address>
Please post questions on <a href="http://piazza.com/">Piazza</a>.
</address>


<!--  LocalWords:  Paxos Sharded shard sharding sharded Put's src shardmaster
-->
<!--  LocalWords:  shardkv cd TestUnreliable Go's RPCs RPC's GID px
-->
<!--  LocalWords:  kvpaxos Config ErrWrongGroup Handin gzipped czvf whoami tgz
-->

</body></html>